One of the objectives of our paper is to evaluate whether the literature of short terms health effects of air pollution suffers from power and bias issues.
In this particular document, we implement robustness tests in order to compute the power, type M and type S error in the studied articles. We look at what would be the power, type M and type S error if the true effect was a fraction of the measured effect.
We retrieved estimates and confidence intervals of articles in the literature of interest in another document. Before diving into the power analysis itself, we look at the characteristics of the articles considered.
We retrieved the articles using the following query:
‘TITLE((“air pollution” OR “air quality” OR “particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide” OR “PM10” OR “PM2.5” OR “carbon dioxide” OR “carbon monoxide”) AND (“emergency” OR “mortality” OR “stroke” OR “cerebrovascular” OR “cardiovascular” OR “death” OR “hospitalization”) AND NOT (“long term” OR “long-term”)) AND “short term”’
This query returns 1534 articles. Based on the abstracts, we can briefly explore the main (unsurprising) themes of the articles:
Not all abstracts display effects and confidence intervals. We thus want to assess whether there are noticeable differences between articles for which we retrieve confidence intervals and those for which we do not. This quick exploration will also provide additional information and descriptive statistics on the whole set of articles.
Out of the 1534 abstracts returned by the query, 698 mention the terms “confidence interval”, “CI”, “Credible Interval”, “95%”, etc. We virtually covered all possible spelling of these different terms.
In these articles, we retrieve valid effects and confidence intervals in the following proportions. Note that a bunch of abstracts contain the phrase “CI” without actually displaying effects and confidence intervals. Our algorithm seems to make a reasonably good job at detecting effects.
| Effect retreived | Number of articles | Proportion |
|---|---|---|
| Yes | 607 | 0.8696275 |
| No | 91 | 0.1303725 |
This corresponds to 2079 valid effects and associated confidence intervals.
Here is a list of the articles for which at least one effect has been retrieved:
In this subsection, we investigate whether there are systematic differences between articles for which we retrieved an effect in the abstract and articles that do not display an effect or for which we did not detect the effect. Importantly, in this analysis, we consider articles that do not mention “confidence intervals” in their abstract as articles for which no effect has been retrieved.
We build this analysis to also get general information about both the entire set of articles and those for which we retrieve an effect. For instance, we build this analysis to see get a sense of the main journal fields in which articles considered as published.
We first wonder whether there are disparities in publication dates. It might be the case that displaying effects in the abstract was a feature of a given period.
Even though there are slightly more recent (2010-2020) articles for which effects are retrieved, the difference does not seem to be substantial. The first article for which an effect is detected was published in 1992. Not many articles were published on this topic before this date. We only found 30 articles published before 1992. In most places, air pollution has only been measured consistently since the 1990s.
We then investigate whether there are differences in the journals in which the articles are published. The results are rather messy so we focus on journal areas and subareas.
One may notice that effects are not retrieved, ie not reported in the abstract or not detected, for most papers published in life science and social sciences and humanities. This might not be as problematic as they constitute a small share of the sample. Most papers studied here have been published in multidisciplinary, health sciences or physical sciences journals.
Then, we wonder if the words used in each types of abstracts differ.
Apart from a few key terms, such as CI, 95 for instance, there are no huge differences in the terms used in both types of abstracts.
We then look at the proportion of articles studying various pollutants.
First of all, we notice that a large share of papers considered here study particulate matters. It seems that, when there are enough articles, our propensity to detect an effect does not seem to vary too much with the type of pollutant. Note that if an article considers several pollutants, it will appear several times in this graph.
We then analyze whether there are reporting or detection differences in terms of outcome considered in the study.
Most articles studied here are interested in mortality. There is no tremendous difference in the reporting of effect and our ability to detect them based on the outcome considered. The proportion of estimates retrieved is however lower for mortality than for emergency admissions.
Some articles focus their analysis on sub-populations such as infants or elderly. We are able to a fraction of these articles if they mention these terms in the abstracts. When these terms are not mentioned, either the entire population is considered or we are not able to detect the subgroup considered. The number of articles for which a subpopulation is indicated his as follows.
| Subpopulation indicated | Number of articles |
|---|---|
| Yes | 174 |
| No or unknown | 1360 |
Looking more in details into the detection of effects, we get the following pattern:
We then look at the number of observation, the length of the the study and the number of cities considered. Importantly, we only retrieve this information for a very limited subset of articles.
| Missing | Length of the study | Number of cities | Number of observations |
|---|---|---|---|
| False | 543 | 676 | 313 |
| True | 991 | 858 | 1221 |
Our analysis is therefore to be taken with caution as there is a critical lack of information for this category.
We notice that there are large variations in the number of observations in the studies considered. However, there does not seem to be large differences along this dimension on whether an effect is retrieved or not.
Now that we have quickly compared the articles for which we retrieve an effect an those for which we do not, we can dig further into the analysis of the estimates retrieved.
In this section, we briefly analyze the effects retrieved. First, we look into the proportion of significant effects.
| Significant | Number of effects | Proportion |
|---|---|---|
| No | 118 | 0.0567581 |
| Yes | 1961 | 0.9432419 |
Most of the effects retrieved here are significant. Research mention their key findings in the abstract and therefore probably do not report non statistically significant estimates for which the null hypothesis of no effect cannot be rejected. Only a very small proportion of articles do not report any statistically significant estimates in their abstract:
| At least one significant estimates | Number of articles | Proportion |
|---|---|---|
| No | 7 | 0.0110585 |
| Yes | 626 | 0.9889415 |
We then look into the distribution of the t-scores.
There seems to be some sort of bunching for t-scores above 1.96. In this analysis, we only consider estimates reported in the abstracts. Authors may only report significant estimates in their abstracts even though they also report non significant estimates in the body of the article. This might explain this bunching. We need to investigate this further in order to understand whether this bunching is evidence of publication bias. We could investigate this further by reproducing the present analysis but analyzing the full texts and not only on the abstracts.
We then plot the distribution of the signal to noise ratio, ie the ratio of the point estimate and the width of the confidence interval.
The graph is of course analogous to the previous one. It however informs us that in a large share of the studies, the magnitude of the noise is larger than the magnitude of the effect. Looking in more details into the distribution of the signal to noise ratio, we notice that for 40% of the estimates considered here, the magnitude of the noise is more important than those of the signal.
| Signal to noise ratio | Percentage of estimates with a lower signal to noise ratio |
|---|---|
| 0.0322581 | 0% |
| 0.5265798 | 10% |
| 0.6374026 | 20% |
| 0.7857143 | 30% |
| 1.0000000 | 40% |
| 1.3108108 | 50% |
| 2.1190476 | 60% |
| 4.6000000 | 70% |
| 10.0000000 | 80% |
| 23.9457143 | 90% |
| 834.8333333 | 100% |
We then turn to the power analysis itself. The objective is to evaluate the power, type M and type S errors for each estimate.
To compute these values, we would need to know the true effect size. Yet, true effects are of course unknown. One solution could be to use estimates from the literature and meta-analyses as best guesses for the true value. Yet, in the setting of this systematic literature review, it is very challenging to retrieve what is exactly measured in each analysis since there is no standardized way of reporting the results. One study may for instance claim that a 10 \(\mu g/m^{3}\) increase in PM2.5 concentration leads to an increase of x% in hospital admissions over the course of a year while another study may state that a 2% increase in ozone concentration increases the number of deaths by 3 over a month. Fortunately, for each estimate retrieved, even though we do not know what is measured, we can evaluate the precision with which it is estimated.
To circumvent the fact that we do not know the actual effect size, we follow the strategy suggested by Gelman and Carlin (2014). We consider different potential “true” effect sizes and run robustness checks. This enables us to investigate what would be the power, type M and type S error if the true effect was only a fraction of the measured effect. It enables us to assess whether the design of the study is good enough to detect a smaller effect. If assuming that the true effect is 3/4 of the measured effect yields a power of 30%, there is a probably a key issue with the design of this study. This design would only enable to detect this (non zero) effect 30% of the time.
Of course, there is no reason to think a priori that a given effect would be overestimated. The results are only informative.
To carry out this analysis, we use the package retrodesign which computes post analysis design calculations (power, type M and type S errors). We run the function retro_desing() for several effect sizes.
In a first part, we carry out our analysis on the whole set of abstracts. We notice that there is some heterogeneity across articles, some articles displaying a high power and others displaying lower power. Thus, in a second part, we will look in more details at articles displaying low power.
We start by computing the average and median power, type M and type S errors for a set of “true” effects.
| Mean | Median | Mean | Median | Mean | Median | |
|---|---|---|---|---|---|---|
| 1% of the measured effect | 0.1040123 | 0.0503025 | 59.238310 | 45.512417 | 0.3406611 | 0.4402240 |
| 5% of the measured effect | 0.2507788 | 0.0575954 | 12.000707 | 9.170205 | 0.1944561 | 0.2311913 |
| 10% of the measured effect | 0.3387450 | 0.0807551 | 6.159897 | 4.664392 | 0.1141361 | 0.0827711 |
| 33% of the measured effect | 0.5389199 | 0.3959030 | 2.189054 | 1.573966 | 0.0167556 | 0.0003240 |
| 50% of the measured effect | 0.6502209 | 0.7288154 | 1.639431 | 1.176970 | 0.0068538 | 0.0000041 |
| 67% of the measured effect | 0.7428472 | 0.9309295 | 1.387917 | 1.041465 | 0.0037939 | 0.0000000 |
| 75% of the measured effect | 0.7795037 | 0.9708755 | 1.314497 | 1.017744 | 0.0030488 | 0.0000000 |
| 90% of the measured effect | 0.8365592 | 0.9961457 | 1.219197 | 1.002488 | 0.0021474 | 0.0000000 |
| 100% of the measured effect | 0.8668059 | 0.9992596 | 1.175799 | 1.000497 | 0.0017509 | 0.0000000 |
| Lower bound of the CI | 0.6465227 | 0.9262411 | 4.126059 | 1.044272 | 0.0406714 | 0.0000000 |
Then, we explore graphically the distribution of power, type M and type S error across simulation and for different magnitudes of the true effect.
A large chunk of articles display high power and low rates of type M and type S error, in each robustness check. However, a non negligible number of articles display lower power and/or some evidence of type M error. Type S error does not seem to be an important issue in this literature. We investigate potential driver of low power and type M errors further in the next subsection.
Note that for type M errors, due to some outliers, we used a log scale. Without this log scale and restricting our sample to type M errors lower than 2.5 (95% of our sample, even when we assume that the true effect is only 1/3 of the estimated one).
We find that, even if the measured effect is the true effect, there is some risk of type M error.
Alternatively, we can also look at what would be the power, type M and type S if the true effect was equal to the lower bound of the confidence interval.
The ECDF also provide useful information on the distribution of power, type M and type S errors across studies.
We notice that about 50% of studies would be underpowered at the conventional 80% level if we considered that the true effect was half the measured effect.
For ECDFs too we can look at what would be the power, type M and type S error if the true effect was equal to the lower bound of the confidence interval.
Then, we look how type M and type S error evolve with power for the estimates considered.
There is a one-to-one relationship between power and type M and type S error. Not surprisingly, type M and type S error skyrocket in studies with low power.
We then investigate how average power, type M and type S evolve as a proportion of the true effect size.
Power, decreases and type M and type S errors skyrocket for small values of the true effect (as a proportion of the measured effect). In addition on average, if for each paper of the literature, the true effects are 3/4 of the measured effect, the power would be lower than the usual 80%. Type S error only seem to be an issue for small values of the true effect as a portion of the measured effect. Type M error seems to be more consistently problematic. The shoot up in the previous graph makes it difficult to read the values of type M error when the true effect is not a small portion of the measured effect. We therefore zoom in.
We notice that, on average in the literature, the treatment effects are overestimated, even for large values of the true effect. This result might be linked to some outliers. We thus look at the evolution of the median effect with true effect size.
We notice that the issue is much less important when looking at the median. This suggests some heterogeneity in terms of power in the literature.
To confirm that, we look into the evolution of the distribution with the proportion of effect size.
The overal distribution of power seems almost bimodal: either the power of most is very high or it is very low.
It might also be interesting to look at how power, type M and type S error evolved in time, ie with publication date.
There does not seem to be a clear trend in the evolution of power and type S error. However, type M error seems to have peaked in the 2010s and to be decreasing again recently.
In the previous section, we noticed that a non negligible number of studies seemed to suffer from a low power issue and associated type M error. We consider that an estimate has low power if its computed power is lower than 80% if the true effect is 3/4 of the measured effect. 80% is the threshold usually used in power analyses but 3/4 is arbitrary and could be changed easily in a robustness check. Following this criterion, the number and proportion of estimates with low power is as follows:
| Power | Number of estimates | Proportion |
|---|---|---|
| Adequate power | 1274 | 0.6127946 |
| Low power | 805 | 0.3872054 |
We investigate the particularities of the articles with low power. We start by reproducing the analyses used to compare articles for which we retrieved an effect and those for which we did not. First, we look into the distribution of publication dates.
It seems that less articles with low power have been published recently, in comparison to articles with adequate power. This confirms our previous finding. We then look into the distribution of articles
Interestingly, some journals, such as “Science of the Total Environment”, the “International Journal of Occupational Medicine and Environmental Health”, the “Chochrane Database of Systematic Reviews”, “Environmental science and pollution research” and the “Journal of Exposure Science and Environmental epidemiology” publish large share of low power studies. On the contrary, BMJ Open publish very few low power studies.
Here also, grouping the journals into big main themes could be more instructive.
There does not seem to be a clear trend in the proportion of articles with low power. If anything it has slightly decreased in the last decade.
We also look into potential disparities in terms of pollutant
There does not seem to be stark differences by pollutant type.
We then compare these outcomes in terms of outcome (mortality or hospital admissions).
There is absolutely no difference along this dimension.